Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information
نویسندگان
چکیده
In deduplication-based backup systems, the chunks of each backup are physically scattered after deduplication, which causes a challenging fragmentation problem. The fragmentation decreases restore performance, and results in invalid chunks becoming physically scattered in different containers after users delete backups. Existing solutions attempt to rewrite duplicate but fragmented chunks to improve the restore performance, and reclaim invalid chunks by identifying and merging valid but fragmented chunks into new containers. However, they cannot accurately identify fragmented chunks due to their limited rewrite buffer. Moreover, the identification of valid chunks is cumbersome and the merging operation is the most time-consuming phase in garbage collection. Our key observation that fragmented chunks remain fragmented in subsequent backups motivates us to propose a History-Aware Rewriting algorithm (HAR). HAR exploits historical information of backup systems to more accurately identify and rewrite fragmented chunks. Since the valid chunks are aggregated in compact containers by HAR, the merging operation is no longer required. To reduce the metadata overhead of the garbage collection, we further propose a Container-Marker Algorithm (CMA) to identify valid containers instead of valid chunks. Our extensive experimental results from real-world datasets show HAR significantly improves the restore performance by 2.6X–17X at a cost of only rewriting 0.45–1.99% data. CMA reduces the metadata overhead for the garbage collection by about 90X .
منابع مشابه
An Optimization of Backup Storage using Backup History and Cache Knowledge in reducing Data Fragmentation for In_line deduplication in Distributed
The chunks of data that are generated after the backup are physically distributed after deduplication in backup system, which creates a problem know as fragmentation. Basically fragmentation basically comes into sparse and outof-order containers. The sparse container adversely affect the performance while restoring the database and garbage collection effectively , while the out-of-order contain...
متن کاملA Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems
In chunk-based deduplication systems, logically consecutive chunks are physically scattered in different containers after deduplication, which results in the serious fragmentation problem. The fragmentation significantly reduces the restore performance due to reading the scattered chunks from different containers. Existing work aims to rewrite the fragmented duplicate chunks into new containers...
متن کاملA Network Differential Backup and Restore System based on a Novel Duplicate Data Detection algorithm
The ever-growing volume and value of data has raised increasing pressure for long-term data protection in storage systems. Moreover, the redundancy in data further aggravates such pressure in these systems. It has become a serious problem to protect data while eliminating data redundancy, saving storage space and network bandwidth as well. Data deduplication techniques greatly optimize storage ...
متن کاملImproving restore speed for backup systems that use inline chunk-based deduplication
Slow restoration due to chunk fragmentation is a serious problem facing inline chunk-based data deduplication systems: restore speeds for the most recent backup can drop orders of magnitude over the lifetime of a system. We study three techniques—increasing cache size, container capping, and using a forward assembly area— for alleviating this problem. Container capping is an ingest-time operati...
متن کاملSimilarity Based Deduplication with Small Data Chunks
Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions...
متن کامل